Libraries

Helping functions

Importing data

Exploratory Data Analysis

By examining features, we want to know how the following features can affect the sale price of property:

Also, we want to build a pricing model based on above features

Investigating missing values

Let's examine categorical variables to figure what to do with missing values

observations in features ['Street','Utilities','Condition2','RoofMatl'] are concentrated in one category so they don't add new information to our data so I'll delete them

What about exploring numerical features

Based on basic exploration of numerical features:

So, I will delete the above features

['MSSubClass','OverallQual','OverallCond','BsmtFullBath','FullBath','HalfBath','BedroomAbvGr','Fireplaces','GarageCars']

So in our reduced dataset, we have total 62 features (19 numerical and 43 categorical)

Handling missing values

Numerical Features

I will fill missing values with most frequently used observation

Categorical Features

Handling Outliers

There are many techniques to detect and handle outliers, I've created a code that runs 6 different outlier detection techniques and get a list of outliers which is confirmed by all these methods.

I think that [LotArea] still has some outliers, I'll handle them again

Let's check how much of data we lost after we completed out preprocessing stage

we lost 8% of data

Data Analysis

Too many variables to inspect, let's filter features which have high correlation

Let's check correlation between above variables

I think YearRemodAdd will be more representative for SalePrice than YearBuilt I will drop the following columns: YearBuilt / GarageYrBlt since they don't add any new value to our dataset

Questions which my help to get insights from our dataset

1- Does property type [MSSubClass] have an effect on [SalePrice]?
2- Relationship between MSZoning and SalePrice
3- Relationship between Neighborhood, MSSubClass, MSZoning and SalePrice
4- Relationship between LotArea, LotFrontage and SalePrice
5- Relationship between OverallQual and OverallCond
6- Relationship between OverallQual and OverallCond and YearBuilt and YearRemodAdd
7- When do they Remod the property
8- Relationship between YearBuilt / YearRemodAdd and SalePrice
9- Relationship between BsmtFinSF1 and BsntFinSF2 and BsmtUnfSF and TotalBsmtSF,
10- Is there any relation between Bsmt and LotArea or SalePrice
11- Do SalePrice affected by month sold over the years
12- Does Condition1 affect any feature?
13- RoofStyle and LotArea or Saleprice
14- SaleType and SaleCondition
15- Does Utilities like: Heating, HeatingQC, CentrailAir, Electrical affect SalePrice, SaleType?
16- Can we combine the above utilities feature into a new one?

Q1

Answer
• 50% of observation are in category 20 and 60, and they contain most of outliers
• All categories share the same range of SalePrice

Q2

Answer
• 93% of observation are on 2 categories RL and RM
• All categories share the same range of SalePrice

Q3

Answer
• Neighborhoods [NridgHt and StoneBr and NoRidge] have high range of SalePrice and the rest of neighborhoods share the same range.
• MSZoning:

Q4

Answer
• In general, there is no correlation between LotArea, LotFrontage and SalePrice even if we show variations in Neighborhood or MSZoning

Q5

Answer
• There is not relation between OverallQual and OverallCond

Q7

Answer
from heatmap and our analysis before we can notice the following:

• In 53% of data, YearBuilt and YearRemodAdd are the same which is not logic, may be it’s used to describe the last time the property was built/renewed
• Remod period ranges from 20 to 100 years

Q8

Answer
from heatmap and our analysis before we can notice the following:

• There is positive correlation between YearRemodAdd and SalePrice
• I think YearRemodAdd will be more representative for SalePrice than YearBuilt I will drop the following columns: YearBuilt / GarageYrBlt since they don't add any new value to our dataset
• Positive correlation with high width between YearBuilt and SalePrice
• Alarm: in 53% of data, YearBuilt and YearRemodAdd are the same which is not logic, may be it’s used to describe the last time the property was built/renewed
• In 60% of data, GarageYearBlt is the same as property yearbuilt which is logic. However, there are 17 observations where garage was built before property

Q9

Answer

Q10

Answer
• There is positive correlation between BsmtFinSF1 and SalePrice and no correlation to LotArea

Q11

Answer
• Most of sales are done in 2nd and 3rd quarter of the year

Q12

Answer
• It’s important for our model I think

Q13

Answer
• no obvious relation

Q14

Answer
• Only Salecondition: new are sold partially

Q15

Answer
• Heating & HeatingQC: other than GasA, you have a very limited options in saleprice
• CentralAir: if property doesn’t have centralAir, it will not have GasA heating and it’s price will drop dramatically
• Electrical: Saleprice will drop if electrical system is other than SBrkr

Q16

Answer

Machine learning

Modeling raw data

Without any feature engineering yet and modelling only numerical values, we created a model with an accuracy of >80%.

Best models so far are Random forest (scaled) / GradientBoosting / Linear / XGBoost

Feature engineering

Filtering

Constant / Quasi constant features

There are no constant features

There are no semi constant features

Correlation

There is not a heavy correlation between ['TotRmsAbvGrd', 'GarageArea']

Wrapping

Embedded

Feature importance

our models didn't perfrom worse and didn't improve very much, let's drop more features

Let's try dropping features ['TotRmsAbvGrd','MSSubClass','Fireplaces','2ndFlrSF','MasVnrArea','OpenPorchSF']

our models didn't perfrom worse and didn't improve very much, let's drop more features : ['OverallCond','BsmtUnfSF','LotFrontage','1stFlrSF']

so far our model isn't performing worse by removing features and making it more simple

the model started to perform worse, so I will neglect the last change and start to improve it by adding categorical variables

Dimensionality reduction

Let's try to add some important categorical features and see if it improves our model

After looping through 34 different categorical features, the following features may be relevant:

Let's construct a model containing those features

mmmm, not big improvement yet

Evaluation models

K fold with random forest

So my best model so far is GradientBoostingRegressor

Optimizing hyperparameters

our best model so far is random forest with n_estimators=100